Search CORE

110 research outputs found

PolyHope: Two-Level Hope Speech Detection from Tweets

Author: Balouchzahi Fazlourrahman
Gelbukh Alexander
Sidorov Grigori
Publication venue
Publication date: 03/11/2022
Field of study

Hope is characterized as openness of spirit toward the future, a desire, expectation, and wish for something to happen or to be true that remarkably affects human's state of mind, emotions, behaviors, and decisions. Hope is usually associated with concepts of desired expectations and possibility/probability concerning the future. Despite its importance, hope has rarely been studied as a social media analysis task. This paper presents a hope speech dataset that classifies each tweet first into "Hope" and "Not Hope", then into three fine-grained hope categories: "Generalized Hope", "Realistic Hope", and "Unrealistic Hope" (along with "Not Hope"). English tweets in the first half of 2022 were collected to build this dataset. Furthermore, we describe our annotation process and guidelines in detail and discuss the challenges of classifying hope and the limitations of the existing hope speech detection corpora. In addition, we reported several baselines based on different learning approaches, such as traditional machine learning, deep learning, and transformers, to benchmark our dataset. We evaluated our baselines using weighted-averaged and macro-averaged F1-scores. Observations show that a strict process for annotator selection and detailed annotation guidelines enhanced the dataset's quality. This strict annotation process resulted in promising performance for simple machine learning classifiers with only bi-grams; however, binary and multiclass hope speech detection results reveal that contextual embedding models have higher performance in this dataset.Comment: 20 pages, 9 figure

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts

Author: Kolesnikova Olga
Kostiuk Yevhen
Sidorov Grigori
Tonja Atnafu Lambebo
Publication venue
Publication date: 02/06/2023
Field of study

In this paper, we investigate the issue of hate speech by presenting a novel task of translating hate speech into non-hate speech text while preserving its meaning. As a case study, we use Spanish texts. We provide a dataset and several baselines as a starting point for further research in the task. We evaluated our baseline results using multiple metrics, including BLEU scores. The aim of this study is to contribute to the development of more effective methods for reducing the spread of hate speech in online communities

arXiv.org e-Print Archive

Creación y evaluación de un diccionario marcado con emociones y ponderado para el español

Author: DIAZ RANGEL ISMAEL
DIAZ RANGEL ISMAEL
SIDOROV GRIGORI
SIDOROV GRIGORI
SUAREZ GUERRA SERGIO
SUAREZ GUERRA SERGIO
Publication venue: Carlos González Vergara
Publication date: 01/06/2014
Field of study

Este artículo presenta un método para la creación de diccionarios marcados con un valor específico (por ejemplo, las emociones, la polaridad) para su uso en varias tareas de procesamiento de lenguaje natural realizadas por computadoras. En el diccionario creado las palabras seleccionadas se han etiquetado con seis emociones básicas. Para eso, las palabras primero fueron analizadas (anotadas) manualmente por múltiples evaluadores y ponderadas automáticamente a base de estas. El método se aplicó para el idioma español. Las palabras que conforman el diccionario fueron etiquetadas con las categorías emocionales básicas: alegría, enojo, miedo, tristeza, sorpresa y repulsión. A diferencia de otros diccionarios para computadoras, el diccionario propuesto contiene ponderaciones—porcentajes de probabilidad de ser usadas con un sentido emocional—. Cada palabra fue valorada por múltiples evaluadores, y posteriormente se realizó un análisis de concordancia con el método de kappa ponderado, adaptándolo para evaluadores múltiples. Con los resultados obtenidos, se propuso una medida que estima qué tan frecuente es el uso afectivo de una palabra: factor de probabilidad de uso afectivo (FPA), el cual sirve para dotar a las palabras potencialmente emocionales con un factor de peso. El FPA puede ser incluido como información en sistemas automáticos, por ejemplo, para la detección de sentimientos en texto. El FPA se refiere a la tendencia del uso de cada palabra, no es una característica absoluta. Así, es útil para los sistemas automáticos

Repositorio Institucional de la Universidad Autónoma del Estado de México

UrduFake@FIRE2021: Shared Track on Fake News Identification in Urdu

Author: Amjad Hamza Imam
Amjad Maaz
Butt Sabur
Gelbukh Alexander
Sidorov Grigori
Zhila Alisa
Publication venue
Publication date: 11/07/2022
Field of study

This study reports the second shared task named as UrduFake@FIRE2021 on identifying fake news detection in Urdu language. This is a binary classification problem in which the task is to classify a given news article into two classes: (i) real news, or (ii) fake news. In this shared task, 34 teams from 7 different countries (China, Egypt, Israel, India, Mexico, Pakistan, and UAE) registered to participate in the shared task, 18 teams submitted their experimental results and 11 teams submitted their technical reports. The proposed systems were based on various count-based features and used different classifiers as well as neural network architectures. The stochastic gradient descent (SGD) algorithm outperformed other classifiers and achieved 0.679 F-score

arXiv.org e-Print Archive

Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021

Author: Amjad Hamza Imam
Amjad Maaz
Butt Sabur
Gelbukh Alexander
Sidorov Grigori
Zhila Alisa
Publication venue
Publication date: 11/07/2022
Field of study

Automatic detection of fake news is a highly important task in the contemporary world. This study reports the 2nd shared task called UrduFake@FIRE2021 on identifying fake news detection in Urdu. The goal of the shared task is to motivate the community to come up with efficient methods for solving this vital problem, particularly for the Urdu language. The task is posed as a binary classification problem to label a given news article as a real or a fake news article. The organizers provide a dataset comprising news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business, split into training and testing sets. The training set contains 1300 annotated news articles -- 750 real news, 550 fake news, while the testing set contains 300 news articles -- 200 real, 100 fake news. 34 teams from 7 different countries (China, Egypt, Israel, India, Mexico, Pakistan, and UAE) registered to participate in the UrduFake@FIRE2021 shared task. Out of those, 18 teams submitted their experimental results, and 11 of those submitted their technical reports, which is substantially higher compared to the UrduFake shared task in 2020 when only 6 teams submitted their technical reports. The technical reports submitted by the participants demonstrated different data representation techniques ranging from count-based BoW features to word vector embeddings as well as the use of numerous machine learning algorithms ranging from traditional SVM to various neural network architectures including Transformers such as BERT and RoBERTa. In this year's competition, the best performing system obtained an F1-macro score of 0.679, which is lower than the past year's best result of 0.907 F1-macro. Admittedly, while training sets from the past and the current years overlap to a large extent, the testing set provided this year is completely different

arXiv.org e-Print Archive

Editorial: Text complexity and simplification

Author: Alexander Gelbukh
Grigori Sidorov
Liana Ermakova
Valery Solovyev
Publication venue: 'Frontiers Media SA'
Publication date: 01/06/2023
Field of study

Directory of Open Access Journals

the role of emotions in native language identification

Author: Carlo Strapparava
Grigori Sidorov
Ilia Markov
Vivi Nastase
Publication venue
Publication date: 01/01/2018
Field of study

Crossref

Archivio della ricerca - Fondazione Bruno Kessler

Open Access Repository

Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models

Author: Gelbukh Alexander
Kalita Jugal
Kolesnikova Olga
Nigatu Hellina Hailu
Sidorov Grigori
Tonja Atnafu Lambebo
Publication venue
Publication date: 27/05/2023
Field of study

This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.Comment: Accepted to Third Workshop on NLP for Indigenous Languages of the America

arXiv.org e-Print Archive

Adapting Cross-Genre Author Profiling to Language and Corpus Notebook for PAN at CLEF 2016

Author: Alexander Gelbukh
Grigori Sidorov
Helena Gómez-Adorno
Ilia Markov
Publication venue
Publication date: 11/04/2020
Field of study

Abstract This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under crossgenre AP conditions in three languages: English, Spanish, and Dutch. Our preprocessing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character n-grams, lexical features, and nontextual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, second order attributes (SOA), tf-idf) and machine learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, logistic regression). For textual feature selection, we applied the transition point technique, except when SOA was used. We found that the optimal configuration was different for different languages at each stage

CiteSeerX